The slide deck includes the examples in this R notebook as well as links to cheatsheets for each tidyverse package and additional reading
install.packages(c(
"devtools",
"tidyverse",
"palmerpenguins"))
devtools::install_github("hadley/emo")
# loading packages
library(tidyverse)
library(palmerpenguins)
library(emo)
# viewing data sets in package "palmerpenguins"
data(package = "palmerpenguins")
Let’s get data into R!
# option 1: load using URL ----
raw_adelie_url <- read_csv("https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-pal.219.3&entityid=002f3893385f710df69eeebe893144ff")
# option 2: load using filepath ----
raw_adelie_filepath <- read_csv("raw_adelie.csv")
Lucky for us, Allison Horst compiled data from all three species together for us in the {palmerpenguins} package!
penguins contains a clean dataset, andpenguins_raw contains raw data# saves package tibble into global environment
penguins <- palmerpenguins::penguins
head(penguins)
penguins_raw <- palmerpenguins::penguins_raw
head(penguins_raw)
A tibble is much like the data frame in base R, but optimized for use in the Tidyverse. Let’s take a look at the differences.
# try each of these commands in the console and see if you can spot the differences!
as_tibble(penguins)
as.data.frame(penguins)
You might see a tibble prints:
tibble)Not so much a concern in an R Markdown file, but noticeable in the console. Print method makes it easier to work with large datasets.
There are a couple of other main differences, namely in subsetting and recycling. Check them out in the `vignette(“tibble”)
Try it out here!
vignette("tibble")
penguinsGet a full view of the dataset:
View(penguins)
Or catch a glimpse:
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ade…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgers…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1,…
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1,…
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 18…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475,…
## $ sex <fct> male, female, female, NA, female, male, female, mal…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200…
Let’s start by making a simple plot of our data!
ggplot2 uses the “Grammar of Graphics” and layers graphical components together to create a plot.
penguins %>%
ggplot()
penguins %>%
ggplot(aes(x = sex, y = body_mass_g))
penguins %>%
ggplot(aes(x = sex, y = body_mass_g)) +
geom_point()
# A scatter plot doesn't really tell us much.
# Let's try a different geometry
penguins %>%
ggplot(aes(x = sex, y = body_mass_g)) +
geom_boxplot()
# That's more informative!
# Let's see if there are differences by penguin species
penguins %>%
ggplot(aes(x = sex, y = body_mass_g)) +
geom_boxplot(aes(fill = species))
# What do you notice?
You might see:
NAs among Chinstrap penguin data points! sex was available for each observationI wonder what percentage of observations are NA for each species? Let’s get the tidyverse to help us with this!
Next stop, dplyr!
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ade…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgers…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1,…
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1,…
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 18…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475,…
## $ sex <fct> male, female, female, NA, female, male, female, mal…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200…
Selecting dataset columns with select()
penguins %>%
select(species, sex, body_mass_g)
Reordering the data set with arrange()
penguins %>%
select(species, sex, body_mass_g) %>%
arrange(desc(body_mass_g))
Summarizing the data using group_by() and summarize()
We can use group_by() to group our data by species and sex, and summarize() to calculate the average body_mass_g for each grouping.
penguins %>%
select(species, sex, body_mass_g) %>%
group_by(species, sex) %>%
summarize(mean = mean(body_mass_g))
If we’re just interested in counting the observations in each grouping, we can group and summarize with special functions count() and add_count().
Counting can be done with group_by() and summarize(), but it’s a little cumbersome.
It involves… 1. using mutate() to create an intermediate variable n_species that adds up all observations per species, and 2. an ungroup()-ing step
penguins %>%
group_by(species) %>%
mutate(n_species = n()) %>%
ungroup() %>%
group_by(species, sex, n_species) %>%
summarize(n = n())
In contrast, count() and add_count() offer a simplified approach.
Thank you to Alison Hill for this suggestion!
penguins %>%
count(species, sex) %>%
add_count(species, wt = n,
name = "n_species")
We can add to our counting example by using mutate() to create a new variable prop, which represents the proportion of penguins of each sex, grouped by species
Thank you to Alison Hill for this suggestions!
penguins %>%
count(species, sex) %>%
add_count(species, wt = n,
name = "n_species") %>%
mutate(prop = n/n_species*100)
Finally, we can filter rows to only show us Chinstrap penguin summaries by adding filter() to our pipeline
penguins %>%
count(species, sex) %>%
add_count(species, wt = n,
name = "n_species") %>%
mutate(prop = n/n_species*100) %>%
filter(species == "Chinstrap")
Currently the year variable in penguins is continuous from 2007 to 2009.
There may be situations where this isn’t what we want and we might want to turn it into a categorical variable instead.
The factor() function is perfect for this.
penguins %>%
mutate(year_factor = factor(year, levels = unique(year)))
The result is a new factor year_factor with levels 2007, 2008 and 2009!
penguins_new <-
penguins %>%
mutate(year_factor = factor(year, levels = unique(year)))
penguins_new
Double check the variable class and factor levels below:
class(penguins_new$year_factor)
## [1] "factor"
levels(penguins_new$year_factor)
## [1] "2007" "2008" "2009"
Let’s play around with strings a little bit!
From what we’ve learned so far, take a guess at what this code chunk will do before running it.
penguins %>%
select(species, island) %>%
mutate(ISLAND = str_to_upper(island))
How about this one? How is it different from the previous code chunk?
penguins %>%
select(species, island) %>%
mutate(ISLAND = str_to_upper(island)) %>%
mutate(species_island = str_c(species, ISLAND, sep = "_"))
Both penguin datasets are already tidy!
We can pretend that it wasn’t and that body_mass_g was recorded separately for male, female, and sex NA penguins. Like untidy_penguins below:
untidy_penguins <-
penguins %>%
pivot_wider(names_from = sex,
values_from = body_mass_g)
untidy_penguins
Now let’s make it tidy again with the help of the pivot_longer() function! pivot_wider()is another very popular tidying function. Have you seen it before? Hint: see the code chunk above!
untidy_penguins %>%
pivot_longer(cols = male:`NA`,
names_to = "sex",
values_to = "body_mass_g")
Ok, we love our earlier boxplot showing us body_mass_g by sex and colored by species… but let’s change up the colors to keep with our Antarctica theme!
I’m a big fan of the color palettes in the nord 📦
Let’s turn this plot:
penguins %>%
ggplot(aes(x = sex, y = body_mass_g)) +
geom_boxplot(aes(fill = species))
Into this one!
penguins %>%
ggplot(aes(x = sex, y = body_mass_g)) +
geom_boxplot(aes(fill = species)) +
scale_fill_manual(values = nord::nord_palettes$frost)
Let’s try out the frost palette.
# we'll need to load the {nord} package
library(nord)
# you can choose colors using the color hex codes
nord::nord_palettes$frost
## [1] "#8FBCBB" "#88C0D0" "#81A1C1" "#5E81AC"
# but you might prefer to use `scale_fill_manual()`
# or more specialized functions like `scale_fill_nord()`
# included in the {nord} package
penguins %>%
ggplot(aes(x = sex, y = body_mass_g)) +
geom_boxplot(aes(fill = species)) +
scale_fill_manual(values = nord::nord_palettes$frost)
#scale_fill_nord(palette = "frost")
Ok now for a handy package/function trio!
# we'll have to load the {prismatic} package
library(prismatic)
prismatic::color(nord::nord_palettes$frost)
## <colors>
## #8FBCBBFF #88C0D0FF #81A1C1FF #5E81ACFF
purrr’s map() function can help us iterate the prismatic::color() function over all palettes in a palette package like nord!
Note: Not all colors will show well, like
polarnightbelow.prismatic::color()relies on a package that kinda has limited functionality in this sense (crayon). It’s doing its best :)
nord::nord_palettes %>% map(prismatic::color)
Let’s practice in real time!
# scatterplot sequence ----
penguins %>%
ggplot() +
geom_point(aes(x = flipper_length_mm, y = bill_length_mm)) # add aesthetics
penguins %>%
ggplot() +
geom_point(aes(x = flipper_length_mm, y = bill_length_mm,
color = species)) # add color per species
penguins %>%
ggplot() +
geom_point(aes(x = flipper_length_mm, y = bill_length_mm,
color = species, shape = species)) # add shape per species
penguins %>%
ggplot() +
geom_point(aes(x = flipper_length_mm, y = bill_length_mm,
color = species, shape = species)) # add shape per species
penguins %>%
ggplot() +
geom_point(aes(x = flipper_length_mm, y = bill_length_mm,
color = species, shape = species)) +
geom_smooth(aes(x = flipper_length_mm, y = bill_length_mm,
color = species))
penguins %>%
ggplot(aes(x = flipper_length_mm, y = bill_length_mm)) +
geom_point(aes(color = species, shape = species)) +
geom_smooth(aes(color = species), se = FALSE, method = "lm")
penguins %>%
ggplot() +
geom_point(aes(x = flipper_length_mm, y = body_mass_g,
color = species, shape = species))
penguins %>%
ggplot() +
geom_histogram(aes(x = flipper_length_mm))
penguins %>%
ggplot() +
geom_histogram(aes(x = flipper_length_mm, color = species))
penguins %>%
ggplot() +
geom_histogram(aes(x = flipper_length_mm, fill = species))
penguins %>%
ggplot() +
geom_histogram(aes(x = flipper_length_mm, fill = species,
position = "identity", alpha = 0.5))
# install.packages("tidytuesdayR")
# remotes::install_github("thebioengineer/tidytuesdayR")
library(tidytuesdayR)
# load the data
tt_data <- tt_load("2020-07-27") # error message
tt_data <- tt_load("2020-07-28")
tt_data <- tt_load(2020, week=31)
# take a peek
readme(tt_data)
print(tt_data)